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Abstract 



In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence 
of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit al- 
gorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets 
are still a topic of very active investigation, motivated by practical applications such as online auctions 
and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets 
' and payoff functions which enable the design of efficient solutions. 

, In this work we study a very general setting for the multi-armed bandit problem in which the strate- 

' gies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the 

O . metric. We refer to this problem as the Lipschitz MAB problem. We present a solution for the multi- 

armed problem in this setting. That is, for every metric space {L, X) we define an isometry invariant 
MaxMinCDV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and we 
^> ' present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique 

gives even better results for benign payoff functions. 

00 
00 

■ 1 Introduction 

^yQ ' In a multi-armed bandit problem, an online algorithm must choose from a set of strategies in a sequence 

O ' of n trials so as to maximize the total payoff of the chosen strategies. These problems are the principal 

^ ! theoretical tool for modeling the exploration/exploitation tradeoffs inherent in sequential decision-making 

'kp I under uncertainty. Studied intensively for the last three decades n HI [131, bandit problems are having 

j_j ■ an increasingly visible impact on computer science because of their diverse applications including online 



auctions, adaptive routing, and the theory of learning in games. The performance of a multi-armed bandit 
algorithm is often evaluated in terms of its regret, defined as the gap between the expected payoff of the 
algorithm and that of an optimal strategy. While the performance of bandit algorithms with a small finite 
strategy set is quite well understood, bandit problems with exponentially or infinitely large strategy sets are 
still a topic of very active investigation |[I][2|4l|Sia|9l[l0l[n][l2l[IllISl[l6l[IFl- 

Absent any assumptions about the strategies and their payoffs, bandit problems with large strategy sets 
allow for no non-trivial solutions — any multi-armed bandit algorithm performs as badly, on some inputs. 
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as random guessing. But in most applications it is natural to assume a structured class of payoff functions, 
which often enables the design of efficient learning algorithms fTF]. In this paper, we consider a broad and 
natural class of problems in which the structure is induced by a metric on the space of strategies. While 
bandit problems have been studied in a few specific metric spaces (such as a one-dimensional interval) 
|[T]|4l|9l|T5l|22l, the case of general metric spaces has not been treated before, despite being an extremely 
natural setting for bandit problems. As a motivating example, consider the problem faced by a website 
choosing from a database of thousands of banner ads to display to users, with the aim of maximizing the 
click-through rate of the ads displayed by matching ads to users' characterizations and the web content that 
they are currently watching. Independently experimenting with each advertisement is infeasible, or at least 
highly inefficient, since the number of ads is too large. Instead, the advertisements are usually organized into 
a taxonomy based on metadata (such as the category of product being advertised) which allows a similarity 
measure to be defined. The website can then attempt to optimize its learning algorithm by generalizing from 
experiments with one ad to make inferences about the performance of similar ads Il22ll23l . Abstractly, we 
have a bandit problem of the following form: there is a strategy set X, with an unknown payoff function 
fi : X ^ [0, 1] satisfying a set of predefined constraints of the form \fJ,{u) — fJ-{v)\ < S{u,v) for some 
u,v ^ X and 5{u, v) > 0. In each period the algorithm chooses a point x & X and observes an independent 
random sample from a payoff distribution whose expectation is fi{x). 

A moment's thought reveals that this abstract problem can be regarded as a bandit problem in a metric 
space. Specifically, if L{u, v) is defined to be the infimum, over all finite sequences u = xq, xi, . . . ,Xk = v 
in X, of the quantity Xj+i), then L is a metric0and the constraints — fj.{v)\ < 6{u, v) may be 

summarized by stating that /i is a Lipschitz function (of Lipschitz constant 1) on the metric space {L,X). We 
refer to this problem as the Lipschitz MAB problem on (L, X), and we refer to the ordered triple (L, X, fi) 
as an instance of the Lipschitz MAB problemjl 

Prior work. While our work is the first to treat the Lipschitz MAB problem in general metric spaces, 
special cases of the problem are implicit in prior work on the continuum-armed bandit problem HI |4l |9l 
[151 — which corresponds to the space [0, 1] under the metric = \x — y\^^'^, d > I — and the 

experimental work on "bandits for taxonomies" [22J, which corresponds to the case in which (L, X) is a tree 
metric. Before describing our results in greater detail, it is helpful to put them in context by recounting the 
nearly optimal bounds for the one-dimensional continuum-armed bandit problem, a problem first formulated 
by R. Agrawal in 1995 [T\ and recently solved (up to logarithmic factors) by various authors 111191 [151. In the 
following theorem and throughout this paper, the regret of a multi-armed bandit algorithm A running on an 
instance (L, X, ^) is defined to be the function which measures the difference between its expected 

payoff at time t and the quantity t sup^.^^ /^(a^)- The latter quantity is the expected payoff of always playing 
a strategy x G argmax fi{x) if such strategy exists. 

Theorem 1.1 (Bll9l [T5l ). For any d > 1, consider the Lipschitz MAB problem on (L^, [0, 1]). There is an 
algorithm A whose regret on any instance fi satisfies RA{t) = 0{t"')for every t, where 7 = No such 
algorithm exists for any 7 < 

In fact, if the time horizon t is known in advance, the upper bound in the theorem can be achieved by 
an extremely naive algorithm which simply uses an optimal A;-armed bandit algorithm (such as the UCB 1 
algorithm |2|) to choose strategies from the set 5" = {0, ^, |, . . . , 1}, for a suitable choice of the param- 
eter k. While the regret bound in Theorem 11.11 is essentially optimal for the Lipschitz MAB problem in 
(Lrf, [0, 1]), it is strikingly odd that it is achieved by such a simple algorithm. In particular, the algorithm 

'More precisely, it is a pseudometric because some pairs of distinct points x,y G X may satisfy L{x, y) = 0. 
When the metric space (L, X) is understood from context, we may also refer to fi as an instance. 
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approximates the strategy set by a fixed mesh S and does not refine this mesh as it gains information about 
the location of the optimal strategy. Moreover, the metric contains seemingly useful proximity information, 
but the algorithm ignores this information after choosing its initial mesh. Is this really the best algorithm? 

A closer examination of the lower bound proof raises further reasons for suspicion: it is based on a 
contrived, highly singular payoff function ^ that alternates between being constant on some distance scales 
and being very steep on other (much smaller) distance scales, to create a multi-scale "needle in haystack" 
phenomenon which nearly obliterates the usefulness of the proximity information contained in the metric 
Lrf. Can we expect algorithms to do better when the payoff function is more benign? For the Lipschitz MAB 
problem on (Li, [0, 1]), the question was answered affirmatively in |j9]|4l for some classes of instances, with 
algorithms that are tuned to the specific classes. 

Our results and techniques. In this paper we consider the Lipschitz MAB problem on arbitrary metric 
spaces. We are concerned with the following two main questions motivated by the discussion above: 

(i) What is the best possible bound on regret for a given metric space? 

(ii) Can one take advantage of benign payoff functions? 

In this paper we give a complete solution to (i), by describing for every metric space X a family of algorithms 
which come arbitrarily close to achieving the best possible regret bound for X. We also give a satisfactory 
answer to (ii); our solution is arbitrarily close to optimal in terms of the zooming dimension defined below. 
In fact, our algorithm for (i) is an extension of the algorithmic technique used to solve (ii). 

Our main technical contribution is a new algorithm, the zooming algorithm, that combines the upper 
confidence bound technique used in earlier bandit algorithms such as UCB 1 with a novel adaptive refinement 
step that uses past history to zoom in on regions near the apparent maxima of ^ and to explore a denser mesh 
of strategies in these regions. This algorithm is a key ingredient in our design of an optimal bandit algorithm 
for every metric space {L,X). Moreover, we show that the zooming algorithm can perform significantly 
better on benign problem instances. That is, for every instance (L, X, ji) we define a parameter called the 
zooming dimension, and use it to bound the algorithm's performance in a way that is often significantly 
stronger than the corresponding per-metric bound. Note that the zooming algorithm is self-tuning, i.e. it 
achieves this bound without requiring prior knowledge of the zooming dimension. 

To state our theorem on the per-metric optimal solution for (i), we need to sketch a few definitions which 
arise naturally as one tries to extend the lower bound from [,15] to general metric spaces. Let us say that a 
subset y in a metric space X has covering dimension d if it can be covered by 0{5^'^) sets of diameter 5 
for all 5 > 0. A point x G X has local covering dimension d if it has an open neighborhood of covering 
dimension d. The space X has max-min-covering dimension d = MaxMinCOV(X) if it has no subspace 
whose local covering dimension is uniformly bounded below by a number greater than d. 

Theorem 1.2. Consider the Lipschitz MAB problem on a compact metric space (L, X). Let d = MaxMinCOV(X). 
If J > then there exists a bandit algorithm A such that for every problem instance I it satisfies 
RA{t) = Oj{C)for all t. No such algorithm exists ifd>0 and 7 < 

In general MaxMinCOV(X) is bounded above by the covering dimension of X. For metric spaces which 
are highly homogeneous (in the sense that any two e-balls are isometric to one another) the two dimensions 
are equal, and the upper bound in the theorem can be achieved using a generalization of the naive algorithm 
described earlier. The difficulty in Theorem 1 1.2 1 lies in dealing with inhomogeneities in the metric spaceH 

'To appreciate this issue, it is very instructive to consider a concrete example of a metric space [L, X) wliere MaxMinCOV(X) 
is strictly less than the covering dimension, and for this specific example design a bandit algorithm whose regret bounds are better 
than those suggested by the covering dimension. This is further discussed in Section|3] 
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It is important to treat the problem at this level of generality, because some of the most natural applications 
of the Lipschitz MAB problem, e.g. the web advertising problem described earlier, are based on highly 
inhomogeneous metric spaces. (That is, in web taxonomies, it is unreasonable to expect different categories 
at the same level of a topic hierarchy to have the roughly the same number of descendants.) 

The algorithm in Theorem 11.21 combines the zooming algorithm described earlier with a delicate trans- 
finite construction over closed subsets consisting of "fat points" whose local covering dimension exceeds a 
given threshold d. For the lower bound, we craft a new dimensionality notion, the max-min-covering di- 
mension introduced above, which captures the inhomogeneity of a metric space, and we connect this notion 
with the transfinite construction that underlies the algorithm. 

For "benign" input instances we provide a better performance guarantee for the zooming algorithm. 
The lower bounds in Theorems 11.11 and 11.21 are based on contrived, highly singular, "needle in haystack" 
instances in which the set of near-optimal strategies is astronomically larger than the set of precisely optimal 
strategies. Accordingly, we quantify the tractability of a problem instance in terms of the number of near- 
optimal strategies. We define the zooming dimension of an instance (L, X, fi) as the smallest d such that the 
following covering property holds: for every (5 > we require only 0(6^'^) sets of diameter 5/8 to cover 
the set of strategies whose payoff falls short of the maximum by an amount between 6 and 26. 

Theorem 1.3. If d is the zooming dimension of a Lipschitz MAB instance then at any time t the zooming 
algorithm suffers regret 0{t'^), 7 = Moreover, this is the best possible exponent 7 as a function of d. 

The zooming dimension can be significantly smaller than the max-min-covering dimension. Let us 
illustrate this point with two examples (where for simplicity the max-min-covering dimension is equal to 
the covering dimension). For the first example, consider a metric space consisting of a high-dimensional 
part and a low-dimensional part. For concreteness, consider a rooted tree T with two top-level branches T' 
and T" which are complete infinite /c-ary trees, k = 2, 10. Assign edge weights in T that are exponentially 
decreasing with distance to the root, and let L be the resulting shortest-path metric on the leaf set xE If 
there is a unique optimal strategy that lies in the low-dimensional part T' then the zooming dimension is 
bounded above by the covering dimension of T', whereas the "global" covering dimension is that of T". 
In the second example, let (L, X) be a homogeneous high-dimensional metric, e.g. the Euclidean metric 
on the unit fc-cube, and the payoff function is = 1 — L{x, S) for some subset S. Then the zooming 
dimension is equal to the covering dimension of S, e.g. it is if S" is a finite point set. 

Discussion. In stating the theorems above, we have been imprecise about specifying the model of com- 
putation. In particular, we have ignored the thorny issue of how to provide an algorithm with an input 
containing a metric space which may have an infinite number of points. The simplest way to interpret our 
theorems is to ignore implementation details and interpret "algorithm" to mean an abstract decision rule, i.e. 
a (possibly randomized) function mapping a history of past observations (xj, rj) G X x [0, 1] to a strategy 
X ^ X which is played in the current period. All of our theorems are valid under this interpretation, but 
they can also be made into precise algorithmic results provided that the algorithm is given appropriate oracle 
access to the metric space. In most cases, our algorithms require only a covering oracle which takes a finite 
collection of open balls and either declares that they cover X or outputs an uncovered point. We refer to this 
setting as the standard Lipschitz MAB problem. For example, the zooming algorithm uses only a covering 
oracle for (L, X), and requires only one oracle query per round (with at most t balls in round t). However, 
the per-metric optimal algorithm in Theorem [L2] uses more complicated oracles, and we defer the definition 
of these oracles to Section [3] 

While our definitions and results so far have been tailored for the Lipschitz MAB problem on infinite 
metrics, some of them can be extended to the finite case as well. In particular, for the zooming algorithm 

''Here a leaf is defined as an infinite path away from the root. 
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we obtain sharp results (that are meaningful for both finite and infinite metrics) using a more precise, non- 
asymptotic version of the zooming dimension. Extending the notions in Theorem II .2^ the finite case is an 
open question. 

Extensions. We provide a number of extensions in which we elaborate on our analysis of the zooming 
algorithm. First, we provide sharper bounds for several examples in which the reward from playing each 
strategy u is fi{u) plus an independent noise of a known and "benign" shape. Second, we upgrade the 
zooming algorithm so that it satisfies the guarantee in Theorem 11.31 and enjoys a better guarantee if the 
maximal reward is exactly 1. Third, we apply this result to a version where /x(-) = 1 — L( • , 5) for some 
target set S which is not revealed to the algorithm. Fourth, we relax some assumptions in the analysis of 
the zooming algorithm, and use this generalization to analyze the version in which fi{-) = 1 — f{L{- , S)) 
for some known function /. Finally, we extend our analysis from reward distributions supported on [0, 1] to 
those with unbounded support and finite absolute third moment. 

Follow-up work. For metric spaces whose max-min-covering dimension is exactly 0, this paper provides 
an upper bound R{T) = Oj{T'^') for any 7 > ^, but no matching lower bound. Characterizing the optimal 
regret for such metric spaces remained an open question. Following the publication of the conference 
version, this question has been settled in ifTTl . revealing the following dichotomy: for every metric space, the 
optimal regret of a Lipschitz MAB algorithm is either bounded above by any / G i^(log t), or bounded below 
by any g G o(\/T), depending on whether the completion of the metric space is compact and countable. 

1.1 Preliminaries 

Given a metric space, B{x,r) denotes an open ball of radius r around point x. Throughout the paper, he 
constants in the O(-) notation are absolute unless specified otherwise. 

Definition 1.4. In the Lipschitz MAB problem on (L, X), there is a strategy set X, a metric space (L, X) of 
diameter < 1, and a payoff function /i : X ^ [0, 1] such that the following Lipschitz condition holds: 

\iJ'{x)- lJ'{y)\<L{x,y) forallx,yeX. (1) 

Call L is the similarity function. The metric space (L, X) is revealed to an algorithm, whereas the payoff 
function /i is not. In each round the algorithm chooses a strategy x ^ X and observes an independent 
random sample from a payoff distribution with support S C [0, 1] and expectation 

The regret of a bandit algorithm A running on a given problem instance is = Wji^{t) — tfi*, where 

Wj[{t) is the expected payoff of A at time t and fi* = sup^^x the maximal expected reward. 

The C -zooming dimension of the problem instance (L, X, fi) is the smallest d such that for every r S 
(0, 1] the set Xr = {x X : | < /x* — fi{x) < r} can be covered by C r^'^ sets of diameter at most r/8. 

Definition 1.5. Fix a metric space on set X. Let N{r) be the smallest number of sets of diameter r required 
to cover X. The covering dimension of X is 

COV(X) = inf { d : 3c Vr > N{r)<cr-'^}. 

The c-covering dimension of X is defined as the infimum of all d such that N{r) < cr~'^ for all r > 0. 

Outline of the paper. In Section [2] we prove Theorem 11.31 In Section [3] we discuss the per-metric opti- 
mality and prove Theorem 1 1.2 1 Section|4] covers the extensions. 
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2 Adaptive exploration: the zooming algorithm 



In this section we introduce the zooming algorithm which uses adaptive exploration to take advantage of the 
"benign" input instances, and prove the main guarantee (Theorem 1 1.3I ). 

Consider the standard Lipschitz MAB problem on (L, X). The zooming algorithm proceeds in phases 
i = 1,2,3,... of 2* rounds each. Let us consider a single phase iph of the algorithm. For each strategy 
V ^ X and time t, let nt{v) be the number of times this strategy has been played in this phase before time t, 
and let fJ-tiv) be the corresponding average reward. Define fitiv) = if nt{v) = 0. Note that at time t both 
quantities are known to the algorithm. Define the confidence radius of v at time t as 

rt{v) := ^8v/(2 + ni(t;)). (2) 

Let ii{v) be the expected reward of strategy v. Note that E[iit{v)] = ^{v). Using Chemoff Bounds, we 
can bound \^J't{v) — /^(t")! in terms of the confidence radius: 

Definition 2.1. A phase is called clean if for each strategy v ^ X that has been played at least once during 
this phase and each time t we have ) — /^(f)! < rt{v). 

Claim 2.2. Phase iph is clean with probability at least 1 — 4"*?^. 

Throughout the execution of the algorithm, a finite number of strategies are designated active. Our 
algorithm only plays active strategies, among which it chooses a strategy v with the maximal index 

It{v)= tit{v) + 2rt{v). (3) 

Say that strategy v covers strategy u at time t if n G B{v, rt{v)). Say that a strategy u is covered at time t if 
at this time it is covered by some active strategy v. Note that the covering oracle (as defined in Section[Tll can 
return a strategy which is not covered if such strategy exists, or else inform the algorithm that all strategies 
are covered. Now we are ready to state the algorithm: 

Algorithm 2.3 (Zooming Algorithm). Each phase i runs for 2* rounds. In the beginning of the phase no 
strategies are active. In each round do the following: 

1. If some strategy is not covered, make it active. 

2. Play an active strategy with the maximal index (13); break ties arbitrarily. 
We formulate the main result of this section as follows: 

Theorem 2.4. Consider the standard Lipschitz MAB problem. Let A be Algorithm \2.3\ Then V C > 

RAit) < 0(C7 log X ti-i/(2+rf) p^^ii (4) 

where d is the C -zooming dimension of the problem instance. 

Remark The zooming algorithm is not parameterized by the C in (|4]l, yet satisfies @ for all C > 0. For 
sharper guarantees, C can be tuned to the specific problem instance and specific time t. 

Let us prove Theorem 12.41 Note that after step 1 in Algorithm 12.31 all strategies are covered. (Indeed, 
if some strategy is activated in step 1 then it covers the entire metric.) Let ^* = sup„gjS(^ be the 
maximal expected reward; note that we do not assume that the supremum is achieved by some strategy. Let 
A(u) = jj* — fi{v). Let us focus on a given phase fph of the algorithm. 



6 



Lemma 2.5. If phase iph is clean then we have iS.{v) < 4 rj [v] for any time t and any strategy v. It follows 

thatnt{v) < O(iph) A~2(ti). 

Proof. Suppose strategy v is played at time t. First we claim that It{v) > Indeed, fix e > 0. By 
definition of ^* there exists a strategy v* such that /S.{v*) < e. Let vt be an active strategy that covers v*. 
By the algorithm specification It{v) > It{vt)- Since v is clean at time t, by definition of index we have 
It{vt) > M(^t) + ft{vt). By the Lipschitz property we have //(ut) > ijl{v*) — L{vt, v*). Since vt covers v*, 
we have L{vt,v*) < rt{vt) Putting all these inequalities together, we have It{v) > lJ.{v*) > fJ,* — e. Since 
this inequality holds for an arbitrary e > 0, we in fact have It{v) > n*. Claim proved. 

Furthermore, note that by the definitions of "clean phase" and "index" we have /i* < Itiv) < fi{v) + 
3rt{v) and therefore A{v) < 3rt{v). 

Now suppose strategy v is not played at time t. If it has never been played before time t in this phase, 
then rt{v) > 1 and thus the lemma is trivial. Else, let s be the last time strategy v has been played before 
time t. Then by definition of the confidence radius rt{v) = rs+i{v) > y^2/3rs{v) > j A(i)). □ 

Corollary 2.6. In a clean phase, for any active strategies u, v we have L{u, v) > j min(A(n), A(u)). 

Proof. Assume u has been activated before v. Let s be the time when v has been activated. Then by the 
algorithm specification we have L{u, v) > rs{u). By Lemma [23] r,(tt) > |A(n). □ 

Let d be the the C-zooming dimension. For a given time t in the current phase, let S{t) be the set of all 
strategies that are active at time t, and let 

A{i,t) ={v£ S{t) : 2' < A-\v) < 2'+^}. 

We claim that \A{i, t)\ < C 2*^^. Indeed, set A{i, t) can be covered by C 2*"' sets of diameter at most 2^Y8; 
by Corollary 12. 61 each of these sets contains at most one strategy from A{i, t). 

Claim 2.7. In a clean phase iph, for each time t we have 

E.es(t)M^)Mv) < o{ci^^y-^t^, (5) 

where 7 = and d is the C-zooming dimension. 

Proof. Fix the time horizon t. For a subset S (Z X of strategies, let Rs = ^^,(=5 A(?7) nt{v). Let us choose 
p G (0, 1) such that 

Define B as the set of all strategies v E S{t) such that A(f ) < p. Recall that by Lemma |23] for each 
V G A{i, t) we have nt{v) < O(iph) A^^(t;). Then 

i?A(M)<0(v)E.eA(M) ^"'W 
<0(2Sph)|^(i,t)| 
< 0(C7iph)2*('^+i) 

Mv)nt{v) <Rb+ Y1 ^M^,t) 

veSit) i<log(l/p) 

<pt + 0(Cv)(i)'^+^ 

<0(t^(Cv)^-^). □ 
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The left-hand side of is essentially the contiibution of the current phase to the overall regret. It 
remains to sum these contributions over all past phases. 

Proof of Theorem |2.4t Let iph be the current phase, let t be the time spend in this phase, and let T be the 
total time since the beginning of phase 1. Let i?ph(«ph5 be the left-hand side of ([5]). Combining Claim IZ21 
and Claim l277l we have 

E[R^^{i^^,t)] < o{c 

«ph — 1 



Ra{T) = E 



i=l 



< 0{C logTf-^ . □ 

3 Attaining the optimal per-metric performance 

In this section we ask, "What is the best possible algorithm for the Lipschitz MAB problem on a given 
metric space?" We consider the per-metric performance, which we define as the worst-case performance of 
a given algorithm over all possible problem instances on a given metric. As everywhere else in this paper, 
we focus on minimizing the exponent 7 such that < for all sufficiently large t. Motivated by the 

shape of the guarantees in Theorem ll.il let us define the regret dimension of an algorithm as follows. 

Definition 3.1. Consider the Lipschitz MAB problem on a given metric space. For algorithm A and problem 
instance T let 

DIMx(^) = inf{3to Vt > to RA{t) < ^1-1/(^+2)}. 

d>0 

The regret dimension of A is DIM(^) = supj DIMx(^), where the supremum is taken over all problem 
instances Z on the given metric space. 

Then Theorem 11.11 states that for the Lipschitz MAB problem on (L^^, [0, 1]), the regret dimension of 
the "naive algorithm" is at most d. In fact, it is easy to extend the "naive algorithm" to arbitrary metric 
spaces. Such algorithm is parameterized by the covering dimension d of the metric space. It divides time 
into phases of exponentially increasing length, chooses a 5-net during each phasej^] and runs a if-armed 
bandit algorithm such as UCB 1 on the elements of the 5-net. The parameter 5 is tuned optimally given d and 
the phase length T; the optimal value turns out to be (5 = T^^^^'^^'^\ Using the technique from [15] it is 
easy to prove that the regret dimension of this algorithm is at most d. 

Lemma 3.2. Consider the Lipschitz MAB problem on a metric space (L, X) of covering dimension d. Let 
A be the naive algorithm that uses UCB 1 in each phase. Then DIM(^) < d. 

Proof. Let A be the naive algorithm. For concreteness, assume each phase i lasts 2* rounds. By definition 
of the covering dimension, it suffices to assume that (i is a c-covering dimension, for some constant c > 0. 
By definition of the regret dimension, it suffices to prove that RA{t) < 0{t^) for all t, where 7 = In 
order to prove that, it suffices to show that for each phase i we have j)(2*) < 0(2*'''), where R[j\,^i){t) 
is the expected regret accumulated in the first t rounds of phase i. 

Let us focus on some phase i. In this phase the algorithm chooses a d-net, call it S. We claim that 
\S\ < c6^'^. Indeed, for any 5' < 5 the metric space can be covered by c5~'^ sets of diameter at most 5' , 
each of which can contain only one point from S. Claim proved. The algorithm proceeds to run UCB 1 on 



It is easy to see that the cardinality of this (5-net is K — 0{5 ). 
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the elements of S. By Q the expected regret of UCBl on K arms in t rounds is at most 0{^yK t log t). 
Since the maximal on 5 is at most 6 off of the maximal fi on X, we have 

RiA,i){t) < 0{y^\S\tlogt) + 6t< 0(VF^ + 6t). 

Plugging in t = 2* and 6 = t-V('i+2)^ ^^^^^^ j)(2*) < 0(2*'^) as claimed. □ 

Thus we ask: is it possible to achieve a better regret dimension, perhaps using a more sophisticated 
algorithm? We show that this is indeed the case. Moreover, we provide an algorithm such that for any given 
metric space its regret dimension is arbitrarily close to optimal. 

The rest of this section is organized as follows. In Section 13.11 we develop a lower bound on regret 
dimension. In Section 13.21 we will show that for some metric spaces, there exist algorithms whose regret 
dimension is smaller than the covering dimension. We develop these ideas further in Section [33] and provide 
an algorithm whose regret dimension is arbitrarily close to optimal. 

3.1 Lower bound on regret dimension 

Let us develop a lower bound on regret dimension of any algorithm on a given metric space. This bound is 
equal to the covering dimension for highly homogeneous metric spaces (such as those in which all balls of 
a given radius are isometric to each other), but in general it can be much smaller. 

It is known [3] that a worst-case instance of the i^-armed bandit problem consists of K — 1 strategies 
with identical payoff distributions, and one which is slightly better. We refer to this as a "needle-in-haystack" 
instance. The known constructions of lower bounds for Lipschitz MAB problems rely on creating a multi- 
scale needle-in-haystack instance in which there are K disjoint open sets, and — 1 of them consist of 
strategies with identical payoff distributions, but in the remaining open set there are strategies whose payoff 
is slightly better. Moreover, this special open set contains K' » K disjoint subsets, only one of which 
contains strategies superior to the others, and so on down through infinitely many levels of recursion. To 
ensure that this construction can be continued indefinitely, one needs to assume a covering property which 
ensures that each of the open sets arising in the construction has sufficiently many disjoint subsets to continue 
to the next level of recursion. 

Definition 3.3. For a metric space (L, X), we say that d is the min-covering dimension of X,d = MinCOV(X), 
if d is the infimum of COV(C/) over all non-empty open subsets U <^ X. The max-min-covering dimension 
of X is defined by 

MaxMiiiCOV(X) = sup{MinCOV(y) : Y QX}. 

The infimum over open [/ C X in the definition of min-covering dimension ensures that every open 
set which may arise in the needle-in-haystack construction described above will contain ^{6'^~'^) disjoint 6- 
balls for some sufficiently small 6, e. Constructing lower bounds for Lipschitz MAB algorithms in a metric 
space X only requires that X should have subsets with large min-covering dimension, which explains the 
supremum over subsets in the definition of max-min-covering dimension. 

We will use the following simple packing lemmaj^ 

Lemma 3.4. IfY is a metric space of covering dimension d, then for any b < d and rg > 0, there exists 
r G (0, ro) such that Y contains a collection of at least disjoint open balls of radius r. 

Proof. Let r < tq be a positive number such that every covering of Y requires more than balls of 
radius 2r. Such an r exists, because the covering dimension of Y is strictly greater than b. Now let V = 

*This is a folklore result; we provide the proof for convenience. 
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{Bi, B2, ■ ■ ■ , Bm} be any maximal collection of disjoint r-balls. For every y £ Y there must exist some 
ball Bi {I < i < M) whose center is within distance 2r of y, as otherwise B{y, r) would be disjoint from 
every element of V contradicting the maximality of that collection. If we enlarge each ball Bi to a ball Bf 
of radius 2r, then every y e y is contained in one of the balls {Bf | 1 < i < M}, i.e. they form a covering 
of Y . Hence M > r^^ as desired. □ 

Theorem 3.5. If X is a metric space and d is the max-min-covering dimension of X then DIM(^) > dfor 
every bandit algorithm A 

Proof. Without loss of generality let us assume that d > 0. Given 7 < , let a < 6 < c < d be 
such that 7 < Let y be a subset of X such that MinCOV(y) > c. Using Lemma l34l we recursively 
construct an infinite sequence of sets Vq,Vi, . . . each consisting of finitely many disjoint open balls in X, 
centered at points of Y . Let Vq = {X} consist of a single ball that contains all of X. \ii > 0, for every 
ball B G Vi^i, let r denote the radius of B and choose a number ri{B) G (0, r/4) such that B contains 
ni{B) = \ri{B)~^^^ disjoint balls of radius ri{B) centered at points of Y . Such a collection of disjoint balls 
exists, by Lemma [X4l Let Vi{B) denote this collection of disjoint balls and let "Pj = UseP 1 ^«(^)- Now 
sample a random sequence of balls i^i , ^2 , • • • by picking Bi £ Vi uniformly at random, and for i > 1 
picking Bi G Vi{Bi_i) uniformly at random. 

Given a ball B = B{x^, r^,), let be a Lipschitz function on X defined by 

J, _ |min{r* - L(x,x*),r*/2} if x e B 
[ otherwise 

Let fi = Jb^ for i > I. Define /o by setting /o(a;) = 1/3 for all x G X. The reader may verify that the sum 
fi = /' ^ Lipschitz function. Define the payoff distribution for x G X to be a Bernoulli random 

variable with expectation fi{x). We have thus specified a randomized construction of an instance (L, X, fi). 
We claim that for any algorithm A and any constant C, 

Pr t RA{t) <Ct"') = 0. (7) 
At, A 

The proof of this claim is based on a "needle in haystack" lemma (Lemma l33] below) which states that for 
all i, conditional on the sequence Bi, . . . , Bi_i, with probability at least 1 — 0((rj(i3j))(''~'^)/^), no more 
than half of the first ti{Bi) = ri{Bi)^"-~'^ strategies picked by A lie inside Bi. The proof of the lemma is 
deferred to the end of this section. 

Any strategy x ^ Bi satisfies fi{x) < /i(x*) — rj/2, so we may conclude that 

Pr {RAiUiBi)) < \ riiB,)---' \ B^ ■ ■ . < O ((r,(i?,))('""^/') • (8) 

Denoting ri{Bi) and ti{Bi) by rj and ti, respectively, we have ^r^""^ = 1 ^ 
sufficiently large As i runs through the positive integers, the terms on the right side of ([8]l are dominated 
by a geometric progression because ri{Bi) < 4~\ By the Borel-Cantelli Lemma, almost surely there are 
only finitely many i such that the events on the left side of ([8]) occur. Thus (Q follows. □ 



Remark. To prove Theorem 13.51 it suffices to show that for every given algorithm there exists a "hard" 
problem instance. In fact we proved a stronger result essentially, we construct a probability distribution 
over problem instances which is hard, almost surely, for every given algorithm. This seems to be the best 
possible bound since, obviously, a single problem instance cannot be hard for every algorithm. 

In rest of this subsection we prove the "needle in haystack" lemma used in the proof of Theorem l3.5l 
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Lemma 3.6. Consider the randomized construction of an instance {L,X,fi) in the proof of Theorem \3.5\ 
Fix a bandit algorithm A. Then for all i, conditional on the sequence Bi, . . . , -Bj-i, with probability at least 
1 — 0((rj(i?j))(^~")/^), no more than half of the first ri{Bi)~^^'^ strategies picked by A lie inside Bi. 

Let us introduce some notation needed to prove the lemma. Let us fix an arbitrary Lipschitz MAB algo- 
rithm A. We will assume that A is deterministic; the corresponding result for randomized algorithms follows 
by conditioning on the algorithm's random bits (so that its behavior, conditional on these bits, is determinis- 
tic), invoking the lemma for deterministic algorithms, and then removing the conditioning by averaging over 
the distribution of random bits. Note that since our construction uses only {0, l}-valued payoffs, and the 
algorithm A is deterministic, the entire history of play in the first t rounds can be summarized by a binary 
vector a e {0, 1}*, consisting of the payoffs observed by A in the first t rounds. Thus a payoff function /i 
determines a probability distribution on the set {0, 1}*, i.e. the distribution on t-step histories realized 
when using algorithm A on instance fi. 

Let B be any ball in the set Vi-i, let n = ni{B), r = ri{B), and t = ti{B) = ri{B)~'^^'^. Let 
B^, B^, . . . , be an enumeration of the balls in Vi{B). Choose an arbitrary sequence of balls Bi D B2 ^ 
... 5 Bi^i = B such that Bi £ Vi and for all j > Bj G ViBj^i). Similarly, for A; = 1, 2, . . . , n, 
choose an arbitrary sequence of balls B'' = B^ 5 5 • • • such that Bj G V{Bj_i) for all j > i. 

Define functions fj {I < j < i — I) and fj {j > i) using the balls Bj,B'y, as in the proof of Theorem 13. 5 1 
Specifically, use definition ^ and set /, = fs and = f^k . Let p,^ = X^^Cn fj and 

CO 

= + (for l<k<n). 

j=i 

Note that the instances fi^ {1 < k < n) are equiprobable under our distribution on input instances fi. The 
instance f/' is not one that could be randomly sampled by our construction, but it is useful as a "reference 
measure" in the following proof. Note that the functions /u'^ have the following properties, by construction. 

(a) 1/3 < n''{x) < 2/3 for all xe X. 

(b) < - < r for all x G X. 

(c) If X G X \ B'', then ^''(x) = ^°(x). 

(d) If X G X \ B^, then there exists some point x'^ G B^ such that ^^{x^) — /i'^(x) > r/2. 

Each of the payoff functions jjJ^ < k < n) gives rise to a probability distribution P^k on {0, 1}* 
as described in the preceding section. We will use the shorthand notation P^ instead of P^k . We will also 
use Efc to denote the expectation of a random variable under distribution P^. Finally, we let Nf^ denote the 
random variable defined on {0, 1}* that counts the number of rounds s (1 < s < t) in which algorithm A 
chooses a strategy in B'^ given the history a. 

The following lemma is analogous to Lemma A.l of fSl, and its proof is identical to the proof of that 
lemma. 

Lemma 3.7. Let f : {0, 1}* [0, M] be any function defined on reward sequences a. Then for any k, 

Efc[/(a)] < Eo[/((t)] + f V-ln(l-4r2)Eo[iV,]. 

Applying Lemma [3^ with / = Nj^ and M = t, and averaging over k, we may apply exactly the same 
reasoning as in the proof of Theorem A.2 of lO to derive the bound 

-VEfc(iV,)<- + 0Ur./^|. (9) 
k=i \ / 
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Recalling that the actual ball Bk sampled when randomly constructing /x in the proof of Theorem 13.51 is 
a uniform random sample from B^, B^, . . . , B^, we may write A'^* to denote the random variable which 
counts the number of rounds in which the algorithm plays a strategy in B^ and the bound ^ implies 



Recalling that t = r ^ and n = r we see that the 0{tr yJTfn) term is the dominant term on the right 

side, and that it is bounded by 0(tr(^^")/^). An application of Markov's inequality now yields: 



completing the proof of Lemma [321 

3.2 Beyond the covering dimension 

Thus far, we have seen that every metric space X has a bandit algorithm A such that DIM(^) = CDV(X) 
(the naive algorithm), and we have seen (via the needle-in-haystack construction. Theorem [33]) that X can 
never have a bandit algorithm satisfying DIM(^) < MaxMinCOV(X). When COV(X) / MaxMinCOV(X), 
which of these two bounds is correct, or can they both be wrong? 

To gain intuition, we will consider two concrete examples. Consider an infinite rooted tree where for 
each level i G N most nodes have out-degree 2, whereas the remaining nodes (called /a? nodes) have out- 
degree X > 2 so that the total number of nodes is 4*. In our first example, there is exactly one fat node on 
every level and the fat nodes form a path (called the. fat leaf). In our second example, there are exactly 2* 
fat nodes on every level i and the fat nodes form a binary tree (called the/af subtree). In both examples, we 
assign a weight of 2^*^^ (for some constant d > 0) to each level-i node; this weight encodes the diameter 
of the set of points contained in the corresponding subtree. An infinite rooted tree induces a metric space 
(L, X) where X is the set of all infinite paths from the root, and for n, u G X we define L(n, v) to be 
the weight of the least common ancestor of paths u and v. In both examples, the covering dimension is 
2d, whereas the max-min-covering dimension is only d because the "fat subset" (i.e. the fat leaf or fat 
subtree) has covering dimension at most d, and every point outside the fat subset has an open neighborhood 
of covering dimension d. 

In both of the metrics described above, the zooming algorithm (Algorithm 12. 3 1 ) performs poorly when the 
optimum x* is located inside the fat subset S, because it is too burdensome to keep covering the profusion 
of strategies located near x* as the ball containing x* shrinks. An improved algorithm, achieving regret 
exponent d, modifies the zooming algorithm by imposing quotas on the number of active strategies that lie 
outside S. At any given time, some strategies outside S may not be covered; however, it is guaranteed that 
there exists an optimal strategy which eventually becomes covered and remains covered forever afterward. 
Intuitively, if some optimal strategy lies in S then imposing a quota on active strategies outside S does not 
hurt. If no optimal strategy lies in S then all of S gets covered eventually and stays covered thereafter, in 
which case the uncovered part of the strategy set has low covering dimension and (starting after the time 
when S becomes permanently covered) no quota is ever exceeded. 

This use of quotas extends to the following general setting which abstracts the idea of "fat subsets": 

Definition 3.8. Fix a metric space (L, X). A closed subset S C X is d-fat if COV(S') < d and for any open 
superset [/ of 5 we have COV(X \ U) < d. More generally, a d-fat decomposition of depth A; is a decreasing 
sequence X = S'o D . . . D D Sk+i = of closed subsets such that C0V(5fe) < d and C0V(5i \ U) < d 
whenever i G [k] and U is an open superset of Sj+i. 

'Recall that a strategy u is called covered at time t if for some active strategy v we have L{u, v) < rt(v). 





Pr(iV, > t/2) = 0{r^^-''^'^) 
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Example 3.9. Let (L, X) be the metric space in either of the two "tree with a fat subset" examples. Then 
the corresponding "fat subset" S is d-fat. For an example of a fat decomposition of depth k = 2, consider 
the product metric {L*,X x X) defined by 

L*{{xi,X2), (yi,y2)) = L{xi,yi) + L(x2,y2), 
with a fat decomposition given by Si = {S x X) U {X x S) and S2 = S x S. 

When X is a metric space with a d*-fat decomposition V, the algorithm described earlier can be modified 
to achieve regret O (f^) for any 7 > 1 — l/{d* + 2), by instituting a separate quota for each subset Si. The 
algorithm requires access to a V-covering oracle which for a given i and a given finite set of open balls 
(given by the centers and the radii) either reports that the balls cover Si, or returns some strategy in Si which 
is not covered by the balls. No further knowledge of V or the metric space is required. 

Theorem 3.10. Consider the Lipschitz MAB problem on a fixed compact metric space with a d*-fat decom- 
position T>. Then for any d > d* there is an algorithm Ad such that DIM(^x') ^ d. 

Remarks. (1) We can relax the compactness assumption in Theorem 13. lOt instead, we can assume that the 
completion of the metric space is compact and re-define the sets in the d-fat decomposition as subsets of the 
completion (possibly disjoint with the strategy set). This corresponds to the "fat leaf" which lies outside the 
strategy set. Such extension requires some minor modifications. 

(2) The per-metric guarantee expressed by Theorem l3. lOl can be complemented with sharper per-instance 
guarantees. First, for every problem instance X the per-instance regret dimension DIMj(^) is upper-bounded 
by the zooming dimension of T. Second, if for some c > the c-covering dimension of X is finite then 
for some 7 < 1 and all t we have RAit) < 0{ct'^). However, as this extension is tangential to our main 
storyline, we focus on analyzing the regret dimension. 

The algorithm. Our algorithm proceeds in phases i = 1, 2, 3, . . . of 2* rounds each. In a given phase, 
we run a fresh instance of the following phase algorithm Apii{T,d,V) parameterized by the phase length 
T = 2\ target dimension d > d* and the P-covering oracle. The phase algorithm is a version of a single 
phase of the zooming algorithm (Algorithm 12.31) with very different rules for activating strategies. As in 
Algorithm 12.31 the confidence radius and the index are defined by (O and respectively. At the start of 
each round some strategies are activated, and then an active strategy with the maximal index is played. 

Let us specify the activation rules. Let k be the depth of the decomposition V, and denote V = {Si}^^^. 
Initially the algorithm constructs 2~-'-nets J\fj, j G N, using the covering oracle. It finds the largest j such 
that M = Mj contains at most ^ T*^/ id+2) pQjjjt^^ ^nd activates all strategies in M. The rest of the active 
strategies are partitioned into k + l pools Pi C Si such that at each time t each pool Pi satisfies the following 
quota (that we denote Qi): 

\{u e Pr. rt{u) > p}\ < Cp p-"^ (10) 

where p = T"^/^'^"'"^) and Cp = (64A: log ^)^- In the beginning of each round the following activation 
routine is performed. If there exists a set Si such that some strategy in Si is not covered and there is room 
under the corresponding quota Qi, pick one such strategy, activate it, and add it to the corresponding pool 
Pi. Since for a given strategy u the confidence radius rt{u) is non-increasing in t, the constraint (fTOl) is 
never violated. Repeat until there are no such sets Si left. This completes the description of the algorithm. 

Analysis. As was the case in Section [2l the analysis of the unbounded-time-horizon algorithm reduces to 
proving a lemma about the regret of each phase algorithm. 
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Lemma 3.11. Fix a problem instance in the setting of Theorem \3. 10\ Let ^ph(T') = .4ph(T, d, D). Then 



(11) 



Note that the lemma bounds the regret of AptiT) for time T > tmin only. Proving Theorem l3.10l is now 
straightforward: 

Proof of Theorem I3.10t Let Aph{T) be the phase algorithm from Lemma [3.1 II Recall that in each phase i 
in the overall algorithm A we simply run a fresh instance of algorithm ^ph(2*) for 2* steps. 

Let to be the tmin from ([TTI ) rounded up to the nearest end-of-phase time. Let io be the phase starting at 
time to + 1. Note that -R^(to) < ^o- Let Ri be the regret accumulated by A during phase i. Let 7 = 
Then for any time t > t]/'^ in phase i we have < to + Ei=io < *o + Ej=io(2-')'^ ^ 0{t^). □ 

In the remainder of this section we prove Lemma 13.111 Let us fix a problem instance of the Lipschitz 
MAB problem on a compact metric space (L, X) with a depth-A; d*-fat decomposition V = {Si}\^Q. Fix 
d > d* and let Aph{T) = ^pji(r, d, V) be the phase algorithm. Let fi be the expected reward function and 
let n* = sup^gjsc ii{u) be the optimal reward. Let A(n) = jj,* — ij,{u). 

By definition of the Lipschitz MAB problem, /i is a continuous function on the metric space {L,X). 
Therefore the supremum /i* is achieved by some strategy (call such strategies optimal). Say that a run of 
algorithm Ap^iT) is well-covered if at every time t < T some optimal strategy is covered. 

Say that a run of algorithm AptiT) is clean if the property in Claim [Z2] holds for all times t <T. Note 
that a given run is clean with probability at least 1 — T^^. The following lemma adapts the technique from 
Lemma 1231 to the present setting: 

Claim 3.12. Consider a clean run of algorithm ^ph(T). 

(a) If strategies u, v are active at time t < T then A{v) — A{u) < 4rt(v). 

(b) if the run is well-covered and strategy v is active at time t <T then A{v) < 4rt(f ). 

The quotas (flOl ) are chosen so that the regret computation in Claim 1277] works out for a clean and well- 
covered run of algorithm .4pii(T). 

Claim 3.13. -R^(r) < t^-^/^'^+'^^ for any clean well-covered run of algorithm A = ^ph(T'). 

Sketch. Let At{6) be the set of all strategies u € X such that u is active at time t <T and 6 < rt{u) < 26. 
Note that for any such strategy we have nt{u) < O(logT) (5~^ and A{u) < 4rt(ti) < 86. Write 



Let Si be the smallest set in V which contains some optimal strategy. Then there is an optimal strategy 
contained in Sg \ S^+i; let u* be one such strategy. The following claim essentially shows that the irrelevant 
high-dimensional subset is eventually pruned away. 

Claim 3.14. There exists an open set U containing Si^i such that u* ^ U and U is always covered 
throughout the first T steps of any clean run of algorithm Apt^{T), provided that T is sufficiently large. 

Proof. S^j^i is a compact set since it is a closed subset of a compact metric space. Since function // is 
continuous, it assumes a maximum value on By construction, this maximum value is strictly less than 
II*. So there exists e > such that A{w) > 8e for any w G S^^i. Define U = B{Si^i, e/2). Note that 
u* since 8e < A{w) < L{u*,w) for any w G Se+i. 



Tiogi/pl 

'i=0 



where p = T and apply the quotas ([TOl l. 



□ 
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Recall that in the beginning of algorithm A{T) all strategies in some 2~'' -net M are activated. Suppose 
T is large enough so that < e. 

Consider a clean run of algorithm Aph{T). We claim that U is covered at any given time t <T. Indeed, 
fix u € [/. By definition of U there exists a strategy w G Se^i such that L{u, w) < e/2. By definition of J\f 
there exist v,v* £ M such that L{v, w) < e and L{u*,v*) < e. Note that: 

(a) /\{v*) = fi{u*) - fi{v*) < L{u*,v*) < e. 

(b) Since L{v, w) < e and A{w) > 8e, we have A{v) > 7e. 

(c) By Claim|3Jlwe have A{v) - A{v*) < Art{v*). 

Combining (a-c), it follows that rt{v) > | e > L{u, v), so v covers u. Claim proved. □ 

Proof of Lemma r3.11t By Claim [3TT3] it suffices to show that if T is sufficiently large then any clean run of 
algorithm Aph{T) is well-covered. (Runs that are not clean contribute only 0(1/T) to the expected regret 
of Apii{T), because the probability that a run is not clean is at most and the regret of such a run is at 
most T.) Specifically, we will show that u* is covered at any time t < T during a clean run of ^pii(T). It 
suffices to show that at any time t <T there is room under the corresponding quota Qi in (ITOl ). 

Let U be the open set from Claim l3T4l Since U is an open neighborhood of S^+i, by definition of the 
fat decomposition it follows that COV{Si \U) < d*. Define p and Cp as in (W^ and fix d' G {d*,d). Then 
for any sufficiently large T it is the case that (i) S£\U can be covered with (i)'^ sets of diameter < p and 

moreover (ii) that < \ Cp p"'^. 

Fix time t < T and let At be the set of all strategies u such that u is in the pool at time t and 
ft{u) > p. Note that At C Si\U since U is always covered, and by the specification of .Aph only active 
uncovered strategies in Si are added to pool Pi. Moreover, At is p-separated. (Indeed, let u,v £ At and 
assume u has been activated before v. Then L(n, v) > rs{u) > rt{u) > p, where s is the time when v was 
activated.) It follows that \At\ < \ Cp p^'^, so there is room under the corresponding quota Qe in (ITOl) . □ 

3.3 The per-metric optimal algorithm 

The algorithm in Theorem l3. lOl requires a fat decomposition of finite depth, which in general might not exist. 
To extend the ideas of the preceding section to arbitrary metric spaces, we must generalize Definition [3]8] to 
transfinitely infinite depth. 

Definition 3.15. Fix a metric space (L, X). Let /3 denote an arbitrary ordinal. A transfinite d-fat decompo- 
sition of depth /? is a transfinite sequence {5'a}o<a</3 of closed subsets of X such that: 

(a) Sq = X, Sfs = 0, and Si, 5 Sx whenever u < X. 

(b) if F C X is closed, then the set {ordinals v < P: V intersects S^} has a maximum element. 

(c) for any ordinal A < /3 and any open set [7 C X containing Sx+i we have C0V(5a \ U) < d. 

Note that for a finite depth /3 the above definition is equivalent to Definition [331 In Theorem lB. 17l below. 
we will show how to modify the "quota algorithms" from the previous section to achieve regret dimension d 
in any metric with a transfinite d*-fat decomposition for d* < d. This gives an optimal algorithm for every 
metric space X because of the following surprising relation between the max-min-covering dimension and 
transfinite fat decompositions. 

Proposition 3.16. For every compact metric space (L, X), the max-min-covering dimension of X is equal 
to the infimum of all d such that X has a transfinite d-fat decomposition. 

Proof If / y C X and MinCOV(y) > d then, by transfinite induction, Y S\ for all A in any 
transfinite d-fat decomposition, contradicting the fact that Sp = 0. Thus, the existence of a transfinite d- 
fat decomposition of X implies d > MaxMinCOV(X). To complete the proof we will construct, given any 



15 



d > MaxMinCOV(X), a transfinite d-fat decomposition of depth /?, where /3 is any ordinal whose cardinality 
exceeds that of X. For a metric space Y, define the set of d-thin points TP(y, d) to be the union of all open 
sets [/ C y satisfying CDV(C/) < d. Its complement, the set of d-fat points, is denoted by FP(y, d). Note 
that it is a closed subset of Y. 

For an ordinal A < /3, we define a set S\ using transfinite induction as follows: 

1. 5o = X and Sx+i = FP(S'a, d) for each ordinal A. 

2. If A is a limit ordinal then S\ = ^i^- 

Note that each Sx is closed, by transfinite induction. It remains to show that V = {S\}\^o satisfies the 
properties (a-c) in Definition l3.15l It follows immediately from the construction that Sq = X and 5 Sx 
when 1/ < A. To prove that = 0, observe first that the sets Sx \ Sx+i (for < A < /?) are disjoint subsets 
of X, and the number of such sets is greater than the cardinaUty of X, so at least one of them is empty. 
This means that 5a = ^a+i for some A < /3. If 5a = then 5^ = as desired. Otherwise, the relation 
FP(5a, d) = Sx implies that MiiiC0V(5A) > d contradicting the assumption that MaxMinCOV(X) < d. This 
completes the proof of property (a). To prove property (b), suppose £ T} is a set of ordinals such that 
Siy^ intersects V for every i. Let u = sup{z^j}. Then S^nV = Cli^xi^i^i ^ ^^'^ latter set is nonempty 
because X is compact and the closed sets {5,^- n F | i G T} have the finite intersection property. Finally, 
to prove property (c), note that if U is an open neighborhood of 5a+i then the set T = Sx\U is closed 
(hence compact) and is contained in TP (5a, d). Consequently T can be covered by open sets V satisfying 
CDV(F) < d. By compactness of T, this covering has a finite subcover Vi,... ,Vm, and consequently 
COV(r) = maxi<j<m COY{Vi) < d. □ 

Theorem 3.17. Consider the Lipschitz. MAB problem on a compact metric space (L, X). For any d > 
MaxMinCOV(X) there exists an algorithm Ad such that DIM(^rf) < d. 

Note that Theorem 1 1 .21 follows immediately by combining Theorem 13 . 1 7 1 with Theorem [33] 
We next describe an algorithm Ad satisfying Theorem l3.17l The algorithm requires two oracles: a depth 
oracle Depth(-) and a P-covering oracle D-Cov(-). For any finite set of open balls Bq, Bi, . . . , Bn (given 
via the centers and the radii) whose union is denoted by B, Depth(i?o, Bi, . . . , Bn) returns the maximum 
ordinal A such that 5a intersects the closure B; such an ordinal exists by Definition 13. 1 5f b)FI Given a finite 
set of open balls Bq, Bi, . . . , Bn with union B as above, and an ordinal A, P-Cov(A, Bq,Bi, . . . , Bn) either 
reports that B covers 5a, or it returns a strategy x £ Sx \ B. 

The algorithm. Our algorithm proceeds in phases i = 1,2,3,... of 2* rounds each. In any given phase i, 
there is a "target ordinal" X{i) (defined at the end of the preceding phase), and we run an algorithm during 
the phase which: (i) activates some nodes initially; (ii) plays a version of the zooming algorithm which only 
activates strategies in 5A(i); (iii) concludes the phase by computing X{i + 1). The details are as follows. 
In a given phase we run a fresh instance of a phase algorithm Aph{T, d, A) where T = 2* and X = X{i) is 
a target ordinal for phase i, defined below when we give the full description of Aph{T, d, A). The goal of 
Aph{T, d, X) is to satisfy the per-phase bound 

RAATAX)iT) = 0{T^) (12) 

for all T > Tq, where 7 = 1 — l/((i + 2) and Tq is a number which may depend on the instance fi. Then, to 
derive the bound RAai^) = 0{t"') for all t we simply sum per-phase bounds over all phases ending before 
time 2t. 

^To avoid the question of iiow arbitrary ordinals are represented on tiie oracle's output tape, we can instead say that the oracle 
outputs a point u £ S\ instead of outputting A. In this case, the definition of I>-Cov should be modified so that its first argument is 
a point of S\ rather than A itself. 
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Initially ^pii(T', d, A) uses the covering oracle to construct 2~-'-nets Afj, j = 0, 1, 2, . . ., until it finds the 
largest j such that J\f = Mj contains at most \ j'd/{d+2) log(T) points. It activates all strategies in M and 
sets 

e{i) = max{2-^32r-i/('^+2) log(r)}. 
After this initiahzation step, for every active strategy v we define the confidence radius 

where nt{v) is the number of times v has been played by the phase algorithm Aph_{T, d, A) before time t. 
Let Bq, Bi, . . . , Bnhe an enumeration of the open balls belonging to the collection 

{B{v, rt{v)) I V active at time t}. 

If n < ^ j^d/((i+2) log(r) then we perform the oracle call D-Cov(A, Bq, . . . , and if it reports that 
a point X £ Sx is uncovered, we activate x and set (x) = 0. The index of an active strategy v is de- 
fined as ^t{v) + 4r((f) — note the slight difference from the index defined in Algorithm 12.31 — and we 
always play the active strategy with maximum index. To complete the description of the algorithm, it re- 
mains to explain how the ordinals X{i) are defined. The definition is recursive, beginning with A(l) = 0. 
At the end of phase i {i > I), we let Bq, Bi, . . . , B^ be an enumeration of the open balls in the set 
{B{v, e{i)) I V active, rj'(f ) < e(i)/2}. Finally, we set X{i + 1) = Depth(i?0) Bi, . . . , Bm)- 

Proof of Theorem |3.17t Since we have modified the definition of index, we must prove a variant of Claim ing 
which asserts the following: 

In a clean run of ^p^, if u, v are active at time t then A(w) — A(m) < 5rt{v). (13) 

To prove it, let s be the latest round in {1,2,..., t} when v was played. We have rt{v) = rs{v), and 
A(f ) — A(ti) = /i(n) — fi{v), so it remains to prove that 

^i{u) - fi{v) <5rsiv). (14) 

From the fact that v was played instead of u at time s, together with the fact that both strategies are clean, 

lis{u) + 4r,(u) < fis{v) + 4r,(i;) (15) 
^(n) - fj.s{u) < rs{u) (16) 
fis{v) - ii{v) < rsiv). (17) 

We obtain ([141 ) by adding (fTSt-lfTT]). noting that rs{u) > 0. This completes the proof of ([T3] ). 

Let A be the maximum ordinal such that Sx contains an optimal strategy u* ; such an ordinal exists by 
Definition 13. 15f b). We will prove that for sufficiently large i, if the i-th phase is clean, then A(i) = A. The 
set Sx+i is compact, and the function fi is continuous, so it assumes a maximum value on Sx+i which is, 
by construction, strictly less than /j,*. Choose e > such that A{w) > 5e for all w G Sx+i, and choose 
To = 2*0 such that e(io) < £• We shall prove that for all T = 2* > Tq and all ordinals u, a clean run 
of AptiTjd,^) results in setting X{i + 1) = A. First, let v* £ M he such that L{u*,v*) < e{i). If v 
is active and rT{v) < e{i)/2 then (fT3] ) implies that A{v) — A{v*) < | hence A{v) < |e(i). As 
A{'w) > 5e > 5s{i) for all w G Sx+i, it follows that the closure of B{v, e{i)) does not intersect 5a+i. This 
guarantees that Depth(i3o; -^i, . . . , Bm) returns an ordinal less than or equal to A. Next we must prove that 
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this ordinal is greater than or equal to A. Note that the total number of strategies activated by Apii{T, d, v) 
is bounded above by T'^/i'^+'^) log(T). Let At denote the set of strategies active at time T and let 

= arg max nT{v). 

By the pigeonhole principle, rixiv'^) > r2/{'^+2)/ log(T) and hence rriv'^) < 3T-i/('^+2) log(T). If t 
denotes the last time at which was played, then we have 

<fi* + l5T~^/(d+2) log^T) <i2* + e(i)/2, 

provided that the phase is clean and that T > Tq. Since had maximum index at time t, we deduce that 
It{v*) < fi* +e{i)/2 as well. As L{u*,v*) < e{i) we have Ht{v*) > //* -rt{v*) provided the phase 

is clean. To finish the proof we observe that 

fi* + e{i)/2 > It{v*) >fi*- e{i) + 3rt{v*) 

which implies rt{v*) < e{i) /2. Since the confidence radius does not increase over time, we have rriv*) < 
e{i)/2 so B{v*,e{i)) is one of the balls Bq, Bi, . . . , Bm- Since u* is contained in the closure of this ball, 
we may conclude that T)epth{BQ, Bi, . . . , Bm) returns the ordinal A as desired. 

Let U = B{Sx+i, e{i)/2). As in Claim [3TT4] it holds that in any clean phase, U is covered throughout 
the phase by balls centered at points of M. Hence for any pair of consecutive clean phases, in the second 
phase of the pair our algorithm only calls the covering oracle V-Cov with the proper ordinal A (i.e. the 
maximum A such that Sx contains an optimal strategy) and with a set of balls Bq, Bi, . . . , Bn that covers 
U. Also, note that an active strategy v during a run of Apii{T, d, A) never has a confidence radius rt{v) less 
than 6 = T^'^/^'^^'^\ so the strategies activated by the covering oracle form a 5-net in the space S\ \ U. By 
Definition 13. 15f c). a 5-net in S\\U contains fewer than 0{5^'^) points. Hence for sufficiently large T the 
"quota" of ^ T'^^ {d+2) ^^^j-jyg strategies is never reached, which implies that every point of S\ — including 
u* — is covered throughout the phase. The upper bound on the regret of w4ph(T, d, A) concludes as in the 
proof of Theorem [23] □ 

4 Zooming algorithm: extensions and examples 

We extend the analysis in Section[2]in several directions, and follow up with examples. 

• In Section |4~T] we note that our analysis works under a more abstract notion of the confidence radius: 
essentially, it can be any function of the history of playing a given strategy such that Claim [Z2l holds. 
This observation leads to sharper results if the reward from playing each strategy u is ^(n) plus an 
independent noise of a known and "benign" shape; we provide several concrete examples. 

• In Section 14.21 we provide an improved version of the confidence radius such that the zooming algo- 
rithm satisfies the guarantee in Theorem [23] and achieves a better regret exponent -^-^ if the maximal 
reward is exactly 1 . The analysis builds on a novel Chemoff-style bound which, to the best of our 
knowledge, has not appeared in the literature. 

• In Section [43] we consider the an example which show-cases both the notion of the zooming dimen- 
sion and the improved algorithm from Section 14.21 It is the target MAB problem, a version of the 
Lipschitz MAB problem in which the expected reward of a given strategy is a equal to its distance 
to some (unknown) target set S. We show that the zooming algorithm performs much better in this 
setting; in particular, if the metric is doubling and S is finite, it achieves poly-logarithmic regret. 
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• In Section l4!4l we relax some of the assumptions in the Lipschitz MAB problem: we do not require 
the similarity function L to satisfy the triangle inequality, and we need the Lipschitz condition ([U to 
hold only if one of the two strategies is optimal. We use this extension to analyze a generalization of 
the target MAB problem in which /i(n) = f{L{u, S)) for some known function /. 

• Finally, in Section 1431 we extend the analysis in Section |2] from reward distributions with bounded 
supporu to arbitrary reward distributions with a finite absolute third moment. Our analysis relies on 
the extension of Azuma inequality known as the non-unifomi Berry-Esseen theorem ll2n . 

Let us recap some conventions we'll be using throughout this section. The zooming algorithm proceeds 
in phases i = 1, 2, 3, ... of 2* rounds each. Within a given phase, for each strategy v ^ X and time t, nt{v) 
is the number of times v has been played before time t, and ^it{v) is the corresponding average reward. Also, 
we denote A(u) = fi* — fi{v), where fi* = sup^^x ^(^) the maximal reward. 

4.1 Abstract confidence radius and noisy rewards 

In Section l4~T] the confidence radius of a given strategy was defined by Here we generalize this definition 
to any function of the history of playing this sti^ategy that satisfies certain properties. 

Definition 4.1. Consider a single phase iph of the algorithm. For each strategy v and any time t within this 
phase, let ft(t;) and fit{v) be non-negative functions of iph, t, and the history of playing v up to round t. 
Call ft(v) a confidence radius with respect to jlt{v) if 

(i) \[it{v) — ^J■{t)\ < rt{v) with probability at least 1 — S"*?"". 

(ii) \rt{v) < rt+i{v) < hiv). 

The confidence radius is {(3, C)-good if nt{v) < (Ciph) A^^{v) whenever A{v) < Aft{v). 

Remark. Property (i) says that Claim [Z2] holds for the appropriately redefined clean phase. Property (ii) is 
a "smoothness" condition: ft(i)) does not increase with time, and does not decrease too fast. It is needed for 
the last line of the proof of Lemma [231 

Given such confidence radius, we can carry out the proof of Theorem 12.41 with very minor modifications. 

Theorem 4.2. Consider an instance of the standard Lipschitz MAB problem for which there exists a cq)- 
good confidence radius, (3 > 0. Let A be an instance ofAleorithm \2.3\ defined with respect to this confidence 
radius. Suppose the problem instance has c-zooming dimension d. Then: 

(a) Ifd + (3>1 then R^it) < a(t) t^-^/'^'^+P'^ for all t, where a{t) = 0[ccq log^ t)i/(rf+/3). 

(b) Ifd + [3<1 then R_A{t) < O{cco log^ t). 

Remark. A new feature of this theorem (as compared to Theorem 12.41 ) is the poly-logarithmic bound on 
regret in part (b). For better intuition on this, note that the exponent in part (a) becomes negative ifd+P < 1. 
Since the regret bound should not be decreasing in t, one would expect this term to vanish from the "correct" 
bound. Indeed, it is easy to check that the computation in the proof of Claim [277] results in part (b). 

A natural application of Theorem 14.21 if a setting in which the reward from playing each strategy u is 
IJ,{u) plus an independent noise of known shape. 

'in Section|4T|we also consider stochastically bounded distributions such as Gaussians. 
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Definition 4.3. The Noisy Lipschitz MAB problem is a standard Lipschitz MAB problem such that every 
time any strategy u is played, the reward is ^(n) plus an independent random sample from some fixed 
distribution V (called the noise distribution) which is revealed to the algorithm. 

We present several examples in which we take advantage of a "benign" shape of V. Interestingly, in 
these examples the payoff distributions are not restricted to have bounded support^j Technically the results 
are simple corollaries of Theorem l4.2[ 

We start with perhaps the most natural example when the noise distribution is normal. 

Corollary 4.4. Consider the Noisy Lipschitz MAB problem with normal noise distribution V = AA(0, u^). 
Then there exists an algorithm A which enjous guarantee (0) with the right-hand side multiplied by a. 

Proof. Define the confidence radius as ^ with the right-hand side multiplied by a. It is easy to see that this 
is a (2, 0(o"))-good confidence radius. The result follows from Theorem 14. 2f a). □ 

Remark. In fact. Corollary 14.41 can be extended to noise distributions of a somewhat more general form: let 
us say that a random variable X is stochastically {p, a)-bounded if its moment-generating function satisfies 

^^^r(X-E[x])^ < grVV2 foj. ^jj ^ ^ [-p^p]. (18) 

Note that a normal distribution AA(0, cr^) is (oo, o") -bounded, and any distribution with support [— (t, o"] is 
(1, cr) -bounded. The meaning of ([TSl l is that it is precisely the condition needed to establish an Azuma-type 
inequality: if S is the sum of n independent stochastically (/>, cr)-bounded random variables with zero mean, 
then with high probability S < 0{ai^/n): 

Pr [S > XaV^ < exp(-AV2) for any A < ^ pcx^n. (19) 

The derivation and the theorem statement needs to be modified slightly to account for the parameter p; we 
omit the details from this version. 

Second, we consider the noiseless case when all probability mass in V is concentrated at 0. Our result 
holds more generally, when V has at least one point mass: a point x S M such that V{x) > 0. 

Corollary 4.5. Consider the Noisy Lipschitz MAB problem such that the noise distribution V has at least 
one point mass. Then the problem admits a confidence radius which is {(3, c)-good for any given (3 > and 
a constant c = c{(3, V). The corresponding low-regret guarantees follow via Theorem \4.2\ 

Proof Sketch. Let S = argmax'P(x) be the set of all points with the largest point mass p = max^. V{x), 
and let q = max^.-p(j.)<p7^(x) be the second largest point mass. Then n = @{\ogt) samples suffices to 
ensure that with high probability each node in S will get at least n{p + q) /2 hits whereas any other node 
will get less, which exactly locates all points in 5. We use confidence radius rt{v) = Q{iph){j)^*^^'^ ■ D 

Third, we consider noise distributions with a "special region" which can be located using a few samples. 
This may be a more efficient way to estimate p{v) than using the standard Chemoff-style tail bounds. 
Moreover, in our examples V may be heavy-tailed, so that Chernoff-style bounds do not hold. 

Corollary 4.6. Consider the Noisy Lipschitz MAB problem with noise distribution V. Suppose V has a 
density f{x) which is symmetric around and non-increasing for x > 0. Assume one of the following: 
(a) f{x) has a sharp peak: f{x) = Q{\x\~°^) for all small enough \x\, where a S (0, 1). 

'"Recall that throughout the paper the payoff distribution of each strategy x has support <S(x) C [0, 1]. In this subsection, by a 
slight abuse of notation, we do not make this assumption. 
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(6) f{x) piecewise continuous on (0, cxd) with at least one jump. 
Then for some constant c-p that depends only on V the problem admits a (/3, cp)-good confidence radius, 
where (a) (3 = 1 — a, (b) (3=1. The corresponding low-regret guarantees follow via Theorem \4.2\ 

Proof Sketch. For part (a), note that for any x > in a neighborhood of we 

Therefore n = 0(x"~^ logt) samples suffices to separate with high probability any length-x sub-interval 
of (—X, x) from any length-x sub-interval of (2x, oo). It follows that using n samples we can approximate 
the mean reward up to ±0{x). Accordingly, we set rt{v) = Q{iph/nt{v))^^^^'°'\ 

For part (b), let xq be the smallest positive point where density / has a jump. Then by continuity there 
exists some e > such that inf^gj-^.^.^ .j.^) /(x) > sup2.g(-2.jj /(x). Therefore for any x < e using 
n = 0(^ log n) samples suffices to separate with high probability any length-x sub-interval of (0, xq) from 
any length-x sub-interval of (xq, oo). It follows that using n samples we can approximate the mean reward 
up to ibO(x). Accordingly, we set rt{v) = Q{ipii/nt{v)). □ 

4.2 What if the maximal expected reward is 1? 

We elaborate the algorithm from Section |2] so that it satisfies the guarantee ^ and performs much better if 
the maximal expected reward is 1. 

Definition 4.7. Consider the Lipschitz MAB problem. Call an algorithm j3-good if there exists an abso- 
lute constant cq such that for any problem instance of c-zooming dimension d it has the properties (ab) in 
Theorem l4.2l Call a confidence radius (3-good if it is (/?, co)-good for some absolute constant cq. 

Theorem 4.8. Consider the standard Lipschitz MAB problem. There is an algorithm A which is 2-good in 
general, and 1-good when the maximal expected reward is 1. 

The key ingredient here is a refined version of the confidence radius which is much sharper than ^ 
when the sample average is close to 1. For phase iph, we define 



In order to analyze (l20l ) we need to establish the following Chemoff-style bound which, to the best of 
our knowledge, has not appeared in the literature: 

Lemma 4.9. Consider n i.i.d. random variables Xi . . . Xn on [0, 1]. Let /x be their mean, and let X be 
their average. Then for any a > the following holds: 



Proof. We will use two well-known Chemoff Bounds which we state below (e.g. see p. 64 of 120]): 




(20) 



Fi[\X -fi\< r{a,X) < 3r{a,n)] > 1 



e~^("), where r {a, x) = ^ + 





n(a) 



we have 



X — fi\ < ayujn < \J aX/n < r{a,X) < 1.5r(a,/x). 
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Now suppose < Then using (Cb2) with a = ^, we obtain that with probabiUty at least 1 — 2 
we have X < ^, and therefore 



< ^ <r(a,X) < (1 + ^2)^ <3r(a,/i). □ 

Proof of Theorem 14.8 1 Let us fix a strategy v and time t. Let us use Lemma |49l with n = nt{v) and 
a = 0(iph) as in (l20l ). setting each random variable Xi equal to 1 minus the reward from the i-th time 
strategy v is played in the current phase. Then fx = fx{v) and X = fit{v), so the Lemma says that 



Pr 



\fj.t{v) - fJ.{v)\ < rt{v) < 3 



a 



nt{v) 



+ 



la{l-fi{v)) 



nt{v) 



> 1 - 2^(°^ 



(21) 



Note that (l20l) is indeed a confidence radius with respect to Ht{v): property (i) in Definition 14. 1 1 holds 
by (|2TI ). and it is easy to check that property (ii) holds, too. It is easy to see that ( |20l ) is a 2-good confidence 
radius. It remains to show that it is 1-good when the maximal reward is 1; this is where we use the upper 
bound on rt{v) in (|2T]) . It suffices to prove the following claim: 



If the maximal reward is 1 and IS.{v) < 4rt{v) then nt{v) < 0{logt) A{v 
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Indeed, let n = nt{v) and A = A(u), and suppose that the maximal reward is 1 and A(w) < 4rt(i;). 
Then by (|2T]) we have A < Art{v) < ^ + ^JaKJn for some a = 0{logt). Now there are two cases. If 
^ < A/2 then ^JolKJu > A — ^ > A(t;)/2, which implies the desired inequality. Else we simply have 
n < 0{a/ A). Claim proved. □ 



4.3 Example: expected reward = distance to the target 

We consider a version of the Lipschitz MAB problem where the expected reward of a given strategy is equal 
to its distance to some target set which is not revealed to the algorithm. 

Definition 4.10. The Target MAB problem on a metric space (L, X) with a target set S C X is the standard 
Lipschitz MAB problem on (L, X) with payoff function fi{u) = 1 — L(n, S). 

Remark. It is a well-known fact that L{u, v) > L{u, S) — L{v, S) for any u,v ^ X and any set S C X. 
Therefore the payoff function /x in Definition 14. 101 is Lipschitz on (L, X). 

Note that in the Target MAB problem the maximal reward is 1 , so we can take advantage of the zooming 
algorithm A from Theorem 14.81 Recall that R^it) < 0(ct^^^/(^+'^)) where d is the c-zooming dimension. 
In this example zooming dimension is about covering B{S, r) with sets of diameter G(r): it is the smallest 
d such that for each r > the ball B{S, r) can be covered with cr~'^ sets of diameter < r/8. 

Let us refine this bound for metric spaces of finite doubling dimension. In particular, we show that for a 
finite target set the zooming algorithm from Theorem 14.81 achieves poly-logarithmic regret. 

Theorem 4.11. Consider the Target MAB problem on a metric space of finite doubling dimension d*. Let 
A be the zooming algorithm from Theorem 14.81 Then 

ii^W < (c2°('^*)log2t) forallt, (22) 

where d is the c-covering dimension of the target set S. 



22 



Proof. By Theorem 14. 81 it suffices to prove that the K-zooming dimension of the pair (L, /x) is at most d, 
for some K = c2^^''''\ In other words, it suffices to cover the set S5 = {u ^ Y : A{u) < 6} with K 5~'^ 
sets of diameter < 5/16, for any given 5 > 0. 

Fix (5 > and note that A(u) = L{u, S). Note that set S can be covered with c6~'^ sets { Cj }j of 
diameter < 5. It follows that the set Ss can be covered with r~'^ sets {B{Ci,r) }j of diameter < 3r. 
Moreover, each set B{Ci, r) can be covered with 2'^^'^*) of sets of diameter < 5/16. □ 

Remarks. This theorem is useful when d < d* , i.e. when the target set is a low-dimensional subset of the 
metric space. Recall that the zooming algorithm is self-tuning: it does not need to know d* and d, and in 
fact it does not even need to know that it is presented with an instance of the Target MAB problem! 

We note in passing that it is very easy to extend Theorem 14. lll to a setting in which the strategy set Y 
is a proper subset of the metric space (L, X) and does not contain the target set S. If L{Y, 5) = then the 
guarantee (l22l ) holds as is. If L{Y, 5) > then the following guarantee holds: 

iiytW < (c2^('^*)log2t) foralH, 

where d is the c-covering dimension of the set B{S, r), r = L{Y, S). 

4.4 The Lipschitz MAB problem under relaxed assumptions 

The analysis in Section |2] does not require all the assumptions in the Lipschitz MAB problem. In fact, it 
never uses the triangle inequality, and applies the Lipschitz condition ([Hi only if (essentially) one of the two 
strategies in (dJ is optimal. Let us formulate our results under the properly relaxed assumptions. In what 
follows, the zooming algorithm will refer to the algorithm in Theorem 14. 8 1 

Theorem 4.12. Consider a version of the Lipschitz MAB problem on (L, X) in which the similarity metric 
is not required to satisfy triangle inequality]^ and the Lipschitz condition ([7]) is replaced by 

(\/u G X) A(u) < L{u,v*) for some v* = argmax/i(f) (23) 

More generally, if such node v* does not exist, assume 

(Ve > 0) (3^;* G X) (Vn G X) A{u) < L{u, v*) + e. (24) 
Then the guarantees for the zooming algorithm in Theorem 14. 81 still hold. 

We apply this theorem to a generalization of the Target MAB problem in which fi{u) = f{L{u,S)) 
for some known non-decreasing shape function f : [0, 1] [0, 1]. Let us define a quasi-distance Lf by 
Lf{u, v) = f{L{u, v)) — /(O). It is easy to see that Lj satisfies (l23l) . Indeed, fix any u* G S. Then 

A(n) = f{Liu, S)) - /(O) < /{Liu, u*)) - /(O) = L^(n, u*) (Vn G X). 

Thus we can use the zooming algorithm on the quasi-distance Lj and enjoy the guarantees in Theorem 14. 8 1 
Below we refine these guarantees for several examples. 

Our goal here is to provide clean illustrative statements rather than cover the most general setting to 
which our refined guarantees apply. Therefore we start with the most concrete example which we formulate 
as a theorem, and follow up with some extensions which we list without a proof. 

"Formally, we require L to be a symmetric function X x X [0, 00] such that L{x, a;) = for all x € X. We call such 
function a quasi-distance on X. 
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Theorem 4.13. Consider the Target MAB problem on a metric space (L, X) of finite doubling dimension 
d*, with shape function f{x) = x^^", a > 0. Let A be the zooming algorithm on {Lf,X). Then 

^^(i) < (c2^('^'hog2t) ^1-1/(1+°'^) forallt, (25) 
where d is the c-covering dimension of the target set S. 

Proof. Consider the pair {Lf,fi). Since the maximal reward is 1, by Theorem 14.81 it suffices to prove that 
for some c* = c2^^'^'^ the c* -zooming dimension of this pair is at most ad. Specifically, for each 5 > Owe 
need to cover the set Ss = {u X : A(n) < 6} with c* 5"'^'^ sets of Lj-diameter at most 5/16. 

Indeed, since A(n) = Lf{u,S), for each u G S'^ we have L(n, 5) < (5°. Thus Ss C B{S,5'^). Let 
e = 16~". As in the proof of Theorem 14. Ill we can show that Ss can be covered by ce~^^'^*^ 6~°"^ sets of 
diameter e 6". Each of these sets has Lj-diameter at most /(e 5"), which is at most (5/16. □ 

Remarks. This theorem includes Theorem 14.31 as a special case f{x) = x. Like the latter, this theorem is 
useful when the target set is a low-dimensional subset of the metric space. 

We consider extensions to more general shape functions and to strategy sets which do not contain S: 

• Suppose the shape function / satisfies the following constraints for some constants a > a* > 0: 

Vx G (0, 1] g{x) > x^/" and g{x) > 2^/'''g{^), 
where g{x) = f{x) — /(O). Then for /3 = 1 + l{/(o)>o} we have 

< (c2^(°*°'*/"hog2t) ti-VC/^+^rf) forallt. 

• Consider the setting in which the strategy set y is a proper subset of the metric space (L, X) and does 
not contain the target set S. If L{u*, S") = for some n* G y then the guarantee ( |25l ) holds as is. In 
general, if we restrict the shape function to f{x) = c + x^^'^, a G (0, 1] then 

< (c20(^*)log2t) ti-i/(2+d) foralH, 

where d is the c-covering dimension of the set S* = B{S, r), r = L{Y, S). Moreover, one can prove 
similar guarantees with d* being the doubling dimension of an open neighborhood of S*, rather than 
that of the entire metric space 

4.5 Heavy-tailed reward distributions 

Consider the Lipschitz MAB problem and let Xn{v) be the reward from the n-th trial of strategy v. The 
current problem formlulation restricts Xn{v) to support [0, 1]. In this section we remove this restriction. 
In fact, it suffices to assume that Xn{v) is an independent random variable with mean fi{v) G [0, 1] and a 
uniformly bounded bounded absolute third moment. Note that different trials of the same strategy {Xn{v) : 
n G N} do not need to be identically distributed. 

Theorem 4.14. Consider the standard Lipschitz MAB problem. Let Xn{v) be the reward from the n-th trial 
of strategy v. Assume that each Xn{v) is an independent random variable with mean /u(f) G [0, 1] and 
furthermore that E [ ] < pfor some constant p. Then there is an algorithm A such that if for some 

c the problem instance has c-zooming dimension d then 

Rj,{t) < a{t) ti-i/{3d+6) forallt, where a{t) = 0{cp\ogtY/^^'^+^\ (26) 
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The proof relies on the non-uniform Berry-Esseen theorem (e.g. see ||2TI for a nice survey) which we 
use to obtain a tail inequality similar to Claim [2!2l for any a > 

Pr[\fitiv) - fi{v)\ > rt{v)] < 0{t-^''), where rt{v) = e(r)/xA^. (27) 

However, this inequality gives much higher failure probability than Claim [Z2l in particular, we cannot take 
a union bound over all active strategies. Accordingly, we need a more refined version of Theorem 14.21 
which is parameterized by the failure probability in (|27] ). In the analysis, instead of the failure events when 
the phase is not clean (see Definition 12.11 ) we need to consider the p-failure events when the tail bound 
from (|27] ) is violated by some strategy v such that /S.{v) > p. Then using the technique from Section |2] we 
can upper-bound Ra{T) in terms of T, d, p and a and choose the optimal values for p and a. 
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